1 Preliminary

Part 1

This document shows an exemplary downstream analysis on immune gene expression using the SingleCellAlleleExperiment (SCAE) package. The used dataset is under controlled access and not for public usage. The here shown results are part of the masters thesis:

Development of a multi-layer R object to study immune gene expression in spatial (and single-cell) transcriptomics at allele, gene and functional level written by Jonas Schuck.

The project was conducted in the research group ‘Computational Immunology’ lead by Dr. Katharina Imkeller at the Edinger Institute at the University Hospital Frankfurt. Dr. Imkeller provided supervision of the project. Primary corrector is Prof. Dr. Florian Buettner at Goethe University Frankfurt.

This document gives only a brief introduction to the SingleCellAlleleExperiment (SCAE) package as well as an overview for the performed steps described in the package application section of the thesis. For further information conduct the thesis.

2 Introduction

The SingleCellAlleleExperiment (SCAE) class is a container for storing and handling allele-aware quantification data for immune genes. The SCAE class is derived from the SingleCellExperiment(SCE) class uses the same object architecture. However, multiple data layers are integrated into the object during object generation. Data from the allele-aware quantification method contains information about alleles for immune genes. During object generation the allele information is aggregated into two additional data layers via an ontology based MHC design principle and extended to the inital raw data using a lookup table. Thus the final SCAE object contains non-immune genes, alleles for immune genes, an immune gene layer and a functional class layer (refer to Figure 1).

For example the counts of the allele A*01:01:01:01 present in the raw input data will be transformed into the HLA-A immune gene layer and the HLA-class I functional class layer. The information for the lookup table is retrieved from the IPD-IMGT/HLA database.

The implemented object follows similar conventions like the SCE class, where rows should represent features (genes, transcripts) and columns should represent cells. Established single cell packages like scater and scran can be used with the SCAE object to perform downstream analysis on immune gene expression.

alt text here Figure 1: Scheme of SingleCellAlleleExperiment object structure with lookup table.

2.1 Expected input

The read in function of the SCAE package readAlleleCounts expects specific files and file identifiers. The stated input directory should contain the following files:

  • cells_x_genes.barcodes.txt (list of barcodes/cell identifiers)
  • cells_x_genes.features.txt (list of feature identifiers)
  • cells_x_genes.mtx (Contains the quantification matrix)
  • lookup_table.csv (Info for generating multiple data layers)

The used dataset for later downstream analysis and testing of the functionalities of the multi-layer object is a dual-center, two-cohort study where whole blood and peripheral blood mononuclear cells underwent scRNA-sequencing. The whole transcriptome dataset is under controlled access and not for public use. The corresponding publication Severe COVID-19 Is Marked by a Dysregulated Myeloid Cell Compartment can be found here.

3 Loading packages

The following package are abundant for performing the downstream analysis and visualization.

library(SingleCellAlleleExperiment)
library(scran)
library(scater)
library(gridExtra)
library(tidyverse)
library(patchwork)
library(cowplot)
library(ggplot2)

4 Reading in data and quality check

4.1 Stating input directory containing the expected files

First the directory path is stated where the input data is saved.

dir_path <- "./input_files/counts_unfiltered"

list.files("./input_files/counts_unfiltered")
## [1] "cells_x_genes.barcodes.txt" "cells_x_genes.genes.txt"   
## [3] "cells_x_genes.mtx"          "lookup_table_HLA_only.csv"

4.2 Knee plots

Starting the workflow with assessing the quality of the underlying dataset. Prior to filtering out low-quality cells the plotKnee() function of the SCAE package can be used. The function offers two different options for the knee plot. A default one (plot B) and an advanced one (plot A) with a computed knee and inflection point on the curve which pose suggestions for potential threshold for filtering out low-quality cells.

ega_knee_adv <- plotKnee(dir_path, "advanced")
ega_knee_def <- plotKnee(dir_path, "default")

knee <- ega_knee_adv + labs(title = "A") + theme_bw() +
        ega_knee_def + labs(title = "B") + theme_bw() + 
        theme(axis.title.y=element_blank(), legend.position = "none")
knee

In plot A, the advanced knee-plot, the inflection point suggests a threshold for filtering out low-quality cells at about 110 total UMI counts. This threshold is passed to the filter_threshold parameter of the readAlleleCounts() function. The dataset used is a whole transcriptome dataset, which is why exp_type parameter is set to WTA. The alternative would be to state Amplicon. The symbols parameter is for retrieving NCBI gene names for the ensemble gene names present in the raw data with the help of the biomaRt package. This parameter is not abundant to state at function execution as the standard value is biomart. For offline execution, orgdb is suggested, which uses the org.Hs.eg.db package.

4.3 Generate SingleCellAlleleExperiment object

SCAE objects pose the same object architecture as SCE objects.

scae <- readAlleleCounts(dir_path,
                         sample.names = "EGA_WTA",
                         filter_threshold = 110,
                         exp_type = "WTA",
                         symbols = "biomart")
## [1] "Runtime check (1/2) Read_in: 7.39461421966553"
## [1] "     Generating SCAE (1/5) extending rowData: 12.8516063690186"
## [1] "     Generating SCAE (2/5) filtering and normalization: 0.339311838150024"
## [1] "     Generating SCAE (3/5) alleles2genes: 2.75878572463989"
## [1] "     Generating SCAE (4/5) genes2functional: 1.73276138305664"
## [1] "     Generating SCAE (5/5) log_transform: 1.73276138305664"
## [1] "Runtime check (2/2) Generating SCAE completed: 19.5463762283325"
## [1] "Total runtime, completed read_in, filtering and normalization and generating scae object 26.9412634372711"
scae
## class: SingleCellAlleleExperiment 
## dim: 62718 47788 
## metadata(0):
## assays(2): counts logcounts
## rownames(62718): ENSG00000160072.20 ENSG00000279928.2 ... HLA_class_I
##   HLA_class_II
## rowData names(4): Ensembl.ID Symbol NI_I Quant_type
## colnames(47788): AAATCCTGTAAACGTACCAAAGTCATT
##   AAATCCTGTAAACGTACCACGTGGGAG ... TTGTTCCAATTGGTATGACATAGGTCA
##   TTGTTCCAATTGGTATGAGCAACATTA
## colData names(3): Sample Barcode sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

4.3.1 RowData slot

Two new classification columns are introduced in the rowData slot. Namely the NI_I column and Quant_type columns. Both columns are used to identify each row of the object to its corresponding data layer.

rowData(scae)
## DataFrame with 62718 rows and 4 columns
##                            Ensembl.ID       Symbol        NI_I  Quant_type
##                           <character>  <character> <character> <character>
## ENSG00000160072.20 ENSG00000160072.20       ATAD3B          NI           G
## ENSG00000279928.2   ENSG00000279928.2     DDX11L17          NI           G
## ENSG00000228037.1   ENSG00000228037.1           NA          NI           G
## ENSG00000142611.17 ENSG00000142611.17       PRDM16          NI           G
## ENSG00000284616.1   ENSG00000284616.1           NA          NI           G
## ...                               ...          ...         ...         ...
## HLA-DQA1                     HLA-DQA1     HLA-DQA1           I           G
## HLA-DQB1                     HLA-DQB1     HLA-DQB1           I           G
## HLA-DPB1                     HLA-DPB1     HLA-DPB1           I           G
## HLA_class_I               HLA_class_I  HLA_class_I           I           F
## HLA_class_II             HLA_class_II HLA_class_II           I           F

4.3.2 ColData slot

As the object extends the count matrix during the object generation, its abundant to compute scaling factors on the raw data prior to extending and integrating the data layers. The scaling factors are used for scaling normalization in a later step of the SCAE constructor.

colData(scae)
## DataFrame with 47788 rows and 3 columns
##                                  Sample                Barcode sizeFactor
##                             <character>            <character>  <numeric>
## AAATCCTGTAAACGTACCAAAGTCATT     EGA_WTA AAATCCTGTAAACGTACCAA..   0.361248
## AAATCCTGTAAACGTACCACGTGGGAG     EGA_WTA AAATCCTGTAAACGTACCAC..   0.188150
## AAATCCTGTAAACGTACCTCTCACGGA     EGA_WTA AAATCCTGTAAACGTACCTC..   0.457581
## AAATCCTGTAAACGTACCTGCATAGTA     EGA_WTA AAATCCTGTAAACGTACCTG..   0.871510
## AAATCCTGTAAACTGCGCAACAACGCG     EGA_WTA AAATCCTGTAAACTGCGCAA..   1.846879
## ...                                 ...                    ...        ...
## TTGTTCCAATTGGTATGAATCATTCTG     EGA_WTA TTGTTCCAATTGGTATGAAT..   0.931718
## TTGTTCCAATTGGTATGAATGGGACTC     EGA_WTA TTGTTCCAATTGGTATGAAT..   2.170497
## TTGTTCCAATTGGTATGACATAACGTT     EGA_WTA TTGTTCCAATTGGTATGACA..   1.467569
## TTGTTCCAATTGGTATGACATAGGTCA     EGA_WTA TTGTTCCAATTGGTATGACA..   0.704433
## TTGTTCCAATTGGTATGAGCAACATTA     EGA_WTA TTGTTCCAATTGGTATGAGC..   2.126846

4.3.3 Extract information from the different data layers

Additionally to the established getters from the SCE package, new getters are implemented to retrieve the different data layers integrated in the SCAE object.

4.3.3.1 Non-immune genes

get_nigenes(scae)
## class: SingleCellAlleleExperiment 
## dim: 62695 47788 
## metadata(0):
## assays(2): counts logcounts
## rownames(62695): ENSG00000160072.20 ENSG00000279928.2 ...
##   ENSG00000277475.1 ENSG00000275405.1
## rowData names(4): Ensembl.ID Symbol NI_I Quant_type
## colnames(47788): AAATCCTGTAAACGTACCAAAGTCATT
##   AAATCCTGTAAACGTACCACGTGGGAG ... TTGTTCCAATTGGTATGACATAGGTCA
##   TTGTTCCAATTGGTATGAGCAACATTA
## colData names(3): Sample Barcode sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
head(rownames(get_nigenes(scae)))
## [1] "ENSG00000160072.20" "ENSG00000279928.2"  "ENSG00000228037.1" 
## [4] "ENSG00000142611.17" "ENSG00000284616.1"  "ENSG00000157911.11"


#### Alleles

get_alleles(scae)
## class: SingleCellAlleleExperiment 
## dim: 14 47788 
## metadata(0):
## assays(2): counts logcounts
## rownames(14): A*02:01:01:01 A*26:01:01:01 ... DPB1*10:01:01:01
##   DPB1*13:01:01:01
## rowData names(4): Ensembl.ID Symbol NI_I Quant_type
## colnames(47788): AAATCCTGTAAACGTACCAAAGTCATT
##   AAATCCTGTAAACGTACCACGTGGGAG ... TTGTTCCAATTGGTATGACATAGGTCA
##   TTGTTCCAATTGGTATGAGCAACATTA
## colData names(3): Sample Barcode sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
head(rownames(get_alleles(scae)))
## [1] "A*02:01:01:01" "A*26:01:01:01" "B*51:01:01:01" "B*57:01:01:01"
## [5] "C*04:01:01:01" "C*06:02:01:01"


#### Immune genes

get_agenes(scae)
## class: SingleCellAlleleExperiment 
## dim: 7 47788 
## metadata(0):
## assays(2): counts logcounts
## rownames(7): HLA-A HLA-B ... HLA-DQB1 HLA-DPB1
## rowData names(4): Ensembl.ID Symbol NI_I Quant_type
## colnames(47788): AAATCCTGTAAACGTACCAAAGTCATT
##   AAATCCTGTAAACGTACCACGTGGGAG ... TTGTTCCAATTGGTATGACATAGGTCA
##   TTGTTCCAATTGGTATGAGCAACATTA
## colData names(3): Sample Barcode sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
head(rownames(get_agenes(scae)))
## [1] "HLA-A"    "HLA-B"    "HLA-C"    "HLA-DRB1" "HLA-DQA1" "HLA-DQB1"


#### Functional class

get_func(scae)
## class: SingleCellAlleleExperiment 
## dim: 2 47788 
## metadata(0):
## assays(2): counts logcounts
## rownames(2): HLA_class_I HLA_class_II
## rowData names(4): Ensembl.ID Symbol NI_I Quant_type
## colnames(47788): AAATCCTGTAAACGTACCAAAGTCATT
##   AAATCCTGTAAACGTACCACGTGGGAG ... TTGTTCCAATTGGTATGACATAGGTCA
##   TTGTTCCAATTGGTATGAGCAACATTA
## colData names(3): Sample Barcode sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
head(rownames(get_func(scae)))
## [1] "HLA_class_I"  "HLA_class_II"


#### Unknown alleles

In case allele information that is stated in the raw data is not found in the lookup table, this marks an potentially invalid allele-identifier. These alleles will be classified accordingly in the Quant_type classification column in the colData slot and can be retrieved with the get_unknown() getter.

get_unknown(scae)
## class: SingleCellAlleleExperiment 
## dim: 0 47788 
## metadata(0):
## assays(2): counts logcounts
## rownames(0):
## rowData names(4): Ensembl.ID Symbol NI_I Quant_type
## colnames(47788): AAATCCTGTAAACGTACCAAAGTCATT
##   AAATCCTGTAAACGTACCACGTGGGAG ... TTGTTCCAATTGGTATGACATAGGTCA
##   TTGTTCCAATTGGTATGAGCAACATTA
## colData names(3): Sample Barcode sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
head(rownames(get_unknown(scae)))
## character(0)

5 Expression evaluation

Checking the expression for the allele-layer, immune gene layer and functional class layer. Allele identifiers are in the form of A*02:01:01:01. The immune genes are in the form of HLA-A and the functional classes HLA_class_I. Here we see that HLA-class I and HLA-C are the most abundant functional class and immune gene respectively in the underlying dataset.

all_alleles <- c(rownames(get_alleles(scae)),
                 rownames(get_agenes(scae)),
                 rownames(get_func(scae)))

plotExpression(scae, all_alleles)


# Downstream analysis

In the following sections, main steps for dimensional reduction are performed, offering insights into the different data layers of the SCAE object.

5.1 Filtering low abundance features

Do we need to do this?????

scae <- scae[rowSums(counts(scae)) > 0, ]
scae
## class: SingleCellAlleleExperiment 
## dim: 29923 47788 
## metadata(0):
## assays(2): counts logcounts
## rownames(29923): ENSG00000160072.20 ENSG00000228037.1 ... HLA_class_I
##   HLA_class_II
## rowData names(4): Ensembl.ID Symbol NI_I Quant_type
## colnames(47788): AAATCCTGTAAACGTACCAAAGTCATT
##   AAATCCTGTAAACGTACCACGTGGGAG ... TTGTTCCAATTGGTATGACATAGGTCA
##   TTGTTCCAATTGGTATGAGCAACATTA
## colData names(3): Sample Barcode sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

5.2 Subsetting the different layers

The non-imune genes are combined with each of the integrated immune gene allele-aware layers to determine three different subsets.

5.2.1 Non-immune genes + alleles

redim_scae <- scae

ni_a <- redim_scae[c(rownames(get_nigenes(redim_scae)), rownames(get_alleles(redim_scae))), ]

ni_a
## class: SingleCellAlleleExperiment 
## dim: 29914 47788 
## metadata(0):
## assays(2): counts logcounts
## rownames(29914): ENSG00000160072.20 ENSG00000228037.1 ...
##   DPB1*10:01:01:01 DPB1*13:01:01:01
## rowData names(4): Ensembl.ID Symbol NI_I Quant_type
## colnames(47788): AAATCCTGTAAACGTACCAAAGTCATT
##   AAATCCTGTAAACGTACCACGTGGGAG ... TTGTTCCAATTGGTATGACATAGGTCA
##   TTGTTCCAATTGGTATGAGCAACATTA
## colData names(3): Sample Barcode sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

5.2.2 Non-immune genes + immune genes

ni_g <- redim_scae[c(rownames(get_nigenes(redim_scae)), rownames(get_agenes(redim_scae))), ]

ni_g
## class: SingleCellAlleleExperiment 
## dim: 29907 47788 
## metadata(0):
## assays(2): counts logcounts
## rownames(29907): ENSG00000160072.20 ENSG00000228037.1 ... HLA-DQB1
##   HLA-DPB1
## rowData names(4): Ensembl.ID Symbol NI_I Quant_type
## colnames(47788): AAATCCTGTAAACGTACCAAAGTCATT
##   AAATCCTGTAAACGTACCACGTGGGAG ... TTGTTCCAATTGGTATGACATAGGTCA
##   TTGTTCCAATTGGTATGAGCAACATTA
## colData names(3): Sample Barcode sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

5.2.3 Non-immune genes + functional class

ni_f <- redim_scae[c(rownames(get_nigenes(redim_scae)), rownames(get_func(redim_scae))), ]

ni_f
## class: SingleCellAlleleExperiment 
## dim: 29902 47788 
## metadata(0):
## assays(2): counts logcounts
## rownames(29902): ENSG00000160072.20 ENSG00000228037.1 ... HLA_class_I
##   HLA_class_II
## rowData names(4): Ensembl.ID Symbol NI_I Quant_type
## colnames(47788): AAATCCTGTAAACGTACCAAAGTCATT
##   AAATCCTGTAAACGTACCACGTGGGAG ... TTGTTCCAATTGGTATGACATAGGTCA
##   TTGTTCCAATTGGTATGAGCAACATTA
## colData names(3): Sample Barcode sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

5.3 Dimensional Reduction

5.3.1 Model variance and HVGs for all data layers

Using the modelGeneVar() function prior to performing HVGs using getTopHVGs. Both functions are part from the scran package.

df_ni_a <- modelGeneVar(ni_a)

head(df_ni_a)
## DataFrame with 6 rows and 6 columns
##                           mean       total        tech          bio     p.value
##                      <numeric>   <numeric>   <numeric>    <numeric>   <numeric>
## ENSG00000160072.20 0.002933487 0.003856521 0.003383365  4.73157e-04 1.75665e-02
## ENSG00000228037.1  0.000567862 0.000769625 0.000654949  1.14677e-04 4.17255e-03
## ENSG00000142611.17 0.000240115 0.000336065 0.000276939  5.91253e-05 6.49211e-04
## ENSG00000157911.11 0.001207261 0.001281901 0.001392406 -1.10505e-04 8.84075e-01
## ENSG00000269896.2  0.000138593 0.000201914 0.000159847  4.20670e-05 3.67483e-05
## ENSG00000228463.10 0.001309151 0.001739702 0.001509922  2.29781e-04 1.09352e-02
##                            FDR
##                      <numeric>
## ENSG00000160072.20 0.061996901
## ENSG00000228037.1  0.017757513
## ENSG00000142611.17 0.003390449
## ENSG00000157911.11 1.000000000
## ENSG00000269896.2  0.000244906
## ENSG00000228463.10 0.041368136
df_ni_g <- modelGeneVar(ni_g)

head(df_ni_g)
## DataFrame with 6 rows and 6 columns
##                           mean       total        tech          bio     p.value
##                      <numeric>   <numeric>   <numeric>    <numeric>   <numeric>
## ENSG00000160072.20 0.002933487 0.003856521 0.003375378  4.81143e-04 6.77935e-03
## ENSG00000228037.1  0.000567862 0.000769625 0.000653403  1.16223e-04 1.03295e-03
## ENSG00000142611.17 0.000240115 0.000336065 0.000276286  5.97791e-05 8.93833e-05
## ENSG00000157911.11 0.001207261 0.001281901 0.001389119 -1.07218e-04 9.09350e-01
## ENSG00000269896.2  0.000138593 0.000201914 0.000159470  4.24443e-05 2.01703e-06
## ENSG00000228463.10 0.001309151 0.001739702 0.001506358  2.33345e-04 3.65000e-03
##                            FDR
##                      <numeric>
## ENSG00000160072.20 2.39261e-02
## ENSG00000228037.1  4.39626e-03
## ENSG00000142611.17 4.66851e-04
## ENSG00000157911.11 1.00000e+00
## ENSG00000269896.2  1.34456e-05
## ENSG00000228463.10 1.38086e-02
df_ni_f <- modelGeneVar(ni_f)

head(df_ni_f)
## DataFrame with 6 rows and 6 columns
##                           mean       total        tech          bio     p.value
##                      <numeric>   <numeric>   <numeric>    <numeric>   <numeric>
## ENSG00000160072.20 0.002933487 0.003856521 0.003387065  4.69457e-04 1.10875e-02
## ENSG00000228037.1  0.000567862 0.000769625 0.000655665  1.13960e-04 2.06304e-03
## ENSG00000142611.17 0.000240115 0.000336065 0.000277242  5.88225e-05 2.31380e-04
## ENSG00000157911.11 0.001207261 0.001281901 0.001393929 -1.12028e-04 9.07633e-01
## ENSG00000269896.2  0.000138593 0.000201914 0.000160022  4.18921e-05 7.79048e-06
## ENSG00000228463.10 0.001309151 0.001739702 0.001511573  2.28129e-04 6.37534e-03
##                            FDR
##                      <numeric>
## ENSG00000160072.20 3.91194e-02
## ENSG00000228037.1  8.77760e-03
## ENSG00000142611.17 1.20830e-03
## ENSG00000157911.11 1.00000e+00
## ENSG00000269896.2  5.19107e-05
## ENSG00000228463.10 2.41147e-02

Plotting the variance for each data-layer.

#plot variance
par(mfrow=c(1,3))
fit_ni_a <- metadata(df_ni_a)
plot(fit_ni_a$mean, fit_ni_a$var,
     xlab = "mean of log-expression", ylab = "variance of log-expression", main = "NI-A")
points(fit_ni_a$mean, fit_ni_a$var, col = "red", pch = 16)
curve(fit_ni_a$trend(x), col = "dodgerblue", add = TRUE, lwd = 2)

fit_ni_g <- metadata(df_ni_g)
plot(fit_ni_g$mean, fit_ni_g$var,
     xlab = "mean of log-expression", ylab = "variance of log-expression", main = "NI-G")
points(fit_ni_g$mean, fit_ni_g$var, col = "red", pch = 16)
curve(fit_ni_g$trend(x), col = "dodgerblue", add = TRUE, lwd = 2)

fit_ni_f <- metadata(df_ni_f)
plot(fit_ni_f$mean, fit_ni_f$var,
     xlab = "mean of log-expression", ylab = "variance of log-expression", main = "NI-F")
points(fit_ni_f$mean, fit_ni_f$var, col = "red", pch = 16)
curve(fit_ni_f$trend(x), col = "dodgerblue", add = TRUE, lwd = 2)

par(mfrow=c(1,1))

Compute a list of HVGs for each data layer. Return the top 0.1 % HVGs per layer using getTopHVGs.

top_ni_a <- getTopHVGs(df_ni_a, prop = 0.1)
length(top_ni_a)
## [1] 1726
head(top_ni_a, n = 15)
##  [1] "ENSG00000244734.4"  "ENSG00000188536.13" "ENSG00000206172.8" 
##  [4] "ENSG00000143546.10" "ENSG00000147454.14" "ENSG00000115523.17"
##  [7] "ENSG00000090382.7"  "ENSG00000197746.15" "ENSG00000163220.11"
## [10] "ENSG00000162722.9"  "C*04:01:01:01"      "C*06:02:01:01"     
## [13] "ENSG00000019582.17" "ENSG00000223609.11" "ENSG00000087086.15"
top_ni_g <- getTopHVGs(df_ni_g, prop = 0.1)
length(top_ni_g)
## [1] 1740
head(top_ni_g, n = 15)
##  [1] "ENSG00000244734.4"  "ENSG00000188536.13" "ENSG00000206172.8" 
##  [4] "ENSG00000143546.10" "ENSG00000147454.14" "ENSG00000115523.17"
##  [7] "ENSG00000090382.7"  "ENSG00000197746.15" "ENSG00000163220.11"
## [10] "ENSG00000162722.9"  "ENSG00000019582.17" "ENSG00000087086.15"
## [13] "ENSG00000223609.11" "ENSG00000158869.11" "ENSG00000133742.14"
top_ni_f <- getTopHVGs(df_ni_f, prop = 0.1)
length(top_ni_f)
## [1] 1718
head(top_ni_f, n = 15)
##  [1] "ENSG00000244734.4"  "ENSG00000188536.13" "ENSG00000206172.8" 
##  [4] "ENSG00000143546.10" "ENSG00000147454.14" "ENSG00000115523.17"
##  [7] "ENSG00000090382.7"  "ENSG00000197746.15" "ENSG00000163220.11"
## [10] "ENSG00000162722.9"  "ENSG00000019582.17" "ENSG00000087086.15"
## [13] "ENSG00000223609.11" "ENSG00000158869.11" "ENSG00000133742.14"

5.3.2 PCA

Compute PCA for each layer and store the results in the object. Its Important to make unique identifiers for each layer/run or the results will be overwritten and just saved as PCA. Here, the runPCA functions from the scater package is used.

set.seed(18)
redim_scae <- runPCA(redim_scae, ncomponents = 10, subset_row = top_ni_a, exprs_values = "logcounts", name = "PCA_a")
redim_scae <- runPCA(redim_scae, ncomponents = 10, subset_row = top_ni_g, exprs_values = "logcounts", name = "PCA_g")
redim_scae <- runPCA(redim_scae, ncomponents = 10, subset_row = top_ni_f, exprs_values = "logcounts", name = "PCA_f")
reducedDimNames(redim_scae)
## [1] "PCA_a" "PCA_g" "PCA_f"

5.3.3 t-SNE

The same goes for running t-SNE with the runTSNE function from the scater package. Unique identifiers are stated here for each layer as well.

set.seed(18)
redim_scae <- runTSNE(redim_scae, dimred= "PCA_a", perplexity = 50,  name = "TSNE_a_50")
set.seed(18)
redim_scae <- runTSNE(redim_scae, dimred= "PCA_g", perplexity = 50,  name = "TSNE_g_50")
set.seed(18)
redim_scae <- runTSNE(redim_scae, dimred= "PCA_f", perplexity = 50,  name = "TSNE_f_50")

List of results from the performed reduced dimension analysis.

reducedDimNames(redim_scae)
## [1] "PCA_a"     "PCA_g"     "PCA_f"     "TSNE_a_50" "TSNE_g_50" "TSNE_f_50"

5.3.4 Save object

saveRDS(redim_scae, file = "example_workflow_t_110_pca_tsne_perplex.rds")
redim_scae <- readRDS(file = "~/Work/R/Masterthesis/dev/SCAE/example_workflow_t_110_pca_tsne_perplex.rds")

5.4 Visualization

Exemplary visualization for the t-SNE results on gene level for immune genes that relate to HLA-class I. In the given datasets, these are the immune genes HLA-A, HLA-B and HLA-C plotted alongside their alleles. This allows for insigts into potential genetic differences shown on allele-level.


5.4.1 HLA-A immune gene and alleles

which_tsne <- "TSNE_g_50"

tsne_g_a  <- plotReducedDim(redim_scae, dimred = which_tsne, colour_by = "HLA-A")
tsne_g_a1 <- plotReducedDim(redim_scae, dimred = which_tsne, colour_by = "A*02:01:01:01")
tsne_g_a2 <- plotReducedDim(redim_scae, dimred = which_tsne, colour_by = "A*26:01:01:01")

p2 <- tsne_g_a + tsne_g_a1 + tsne_g_a2
p2


5.4.2 HLA-B immune gene and alleles

tsne_g_b  <- plotReducedDim(redim_scae, dimred = which_tsne, colour_by = "HLA-B")
tsne_g_b1 <- plotReducedDim(redim_scae, dimred = which_tsne, colour_by = "B*51:01:01:01")
tsne_g_b2 <- plotReducedDim(redim_scae, dimred = which_tsne, colour_by = "B*57:01:01:01")

p3 <- tsne_g_b + tsne_g_b1 + tsne_g_b2
p3


5.4.3 HLA-C immune gene and alleles

which_tsne <- "TSNE_g_50"

tsne_g_c  <- plotReducedDim(redim_scae, dimred = which_tsne, colour_by = "HLA-C")
tsne_g_c1 <- plotReducedDim(redim_scae, dimred = which_tsne, colour_by = "C*04:01:01:01")
tsne_g_c2 <- plotReducedDim(redim_scae, dimred = which_tsne, colour_by = "C*06:02:01:01")

p1 <- tsne_g_c + tsne_g_c1 + tsne_g_c2
p1

6 Session Information

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] cowplot_1.1.1                         patchwork_1.1.3                      
##  [3] lubridate_1.9.2                       forcats_1.0.0                        
##  [5] stringr_1.5.0                         dplyr_1.1.2                          
##  [7] purrr_1.0.2                           readr_2.1.4                          
##  [9] tidyr_1.3.0                           tibble_3.2.1                         
## [11] tidyverse_2.0.0                       gridExtra_2.3                        
## [13] scater_1.22.0                         ggplot2_3.4.3                        
## [15] scran_1.22.1                          scuttle_1.4.0                        
## [17] SingleCellAlleleExperiment_0.0.0.9000 SingleCellExperiment_1.16.0          
## [19] SummarizedExperiment_1.24.0           Biobase_2.54.0                       
## [21] GenomicRanges_1.46.1                  GenomeInfoDb_1.30.1                  
## [23] IRanges_2.28.0                        S4Vectors_0.32.4                     
## [25] BiocGenerics_0.40.0                   MatrixGenerics_1.6.0                 
## [27] matrixStats_1.0.0                     BiocStyle_2.22.0                     
## 
## loaded via a namespace (and not attached):
##   [1] Rtsne_0.16                ggbeeswarm_0.7.2         
##   [3] colorspace_2.1-0          bluster_1.4.0            
##   [5] XVector_0.34.0            BiocNeighbors_1.12.0     
##   [7] rstudioapi_0.15.0         farver_2.1.1             
##   [9] ggrepel_0.9.3             bit64_4.0.5              
##  [11] AnnotationDbi_1.56.2      fansi_1.0.4              
##  [13] xml2_1.3.5                R.methodsS3_1.8.2        
##  [15] sparseMatrixStats_1.6.0   cachem_1.0.8             
##  [17] knitr_1.43                jsonlite_1.8.7           
##  [19] cluster_2.1.4             dbplyr_2.3.3             
##  [21] R.oo_1.25.0               png_0.1-8                
##  [23] HDF5Array_1.22.1          BiocManager_1.30.22      
##  [25] compiler_4.1.2            httr_1.4.6               
##  [27] dqrng_0.3.0               Matrix_1.6-1             
##  [29] fastmap_1.1.1             limma_3.50.3             
##  [31] cli_3.6.1                 BiocSingular_1.10.0      
##  [33] htmltools_0.5.6           prettyunits_1.1.1        
##  [35] tools_4.1.2               rsvd_1.0.5               
##  [37] igraph_1.5.1              gtable_0.3.3             
##  [39] glue_1.6.2                GenomeInfoDbData_1.2.7   
##  [41] rappdirs_0.3.3            Rcpp_1.0.11              
##  [43] jquerylib_0.1.4           rhdf5filters_1.6.0       
##  [45] vctrs_0.6.3               Biostrings_2.62.0        
##  [47] DelayedMatrixStats_1.16.0 xfun_0.40                
##  [49] beachmat_2.10.0           timechange_0.2.0         
##  [51] lifecycle_1.0.3           irlba_2.3.5.1            
##  [53] statmod_1.5.0             XML_3.99-0.14            
##  [55] org.Hs.eg.db_3.14.0       edgeR_3.36.0             
##  [57] zlibbioc_1.40.0           scales_1.2.1             
##  [59] hms_1.1.3                 parallel_4.1.2           
##  [61] rhdf5_2.38.1              yaml_2.3.7               
##  [63] curl_5.0.2                memoise_2.0.1            
##  [65] sass_0.4.7                biomaRt_2.50.3           
##  [67] stringi_1.7.12            RSQLite_2.3.1            
##  [69] highr_0.10                ScaledMatrix_1.2.0       
##  [71] filelock_1.0.2            BiocParallel_1.28.3      
##  [73] rlang_1.1.1               pkgconfig_2.0.3          
##  [75] bitops_1.0-7              evaluate_0.21            
##  [77] lattice_0.21-8            Rhdf5lib_1.16.0          
##  [79] labeling_0.4.2            bit_4.0.5                
##  [81] tidyselect_1.2.0          magrittr_2.0.3           
##  [83] R6_2.5.1                  generics_0.1.3           
##  [85] metapod_1.2.0             DelayedArray_0.20.0      
##  [87] DBI_1.1.3                 pillar_1.9.0             
##  [89] withr_2.5.0               KEGGREST_1.34.0          
##  [91] RCurl_1.98-1.12           crayon_1.5.2             
##  [93] DropletUtils_1.14.2       utf8_1.2.3               
##  [95] BiocFileCache_2.2.1       tzdb_0.4.0               
##  [97] rmarkdown_2.24            viridis_0.6.4            
##  [99] progress_1.2.2            locfit_1.5-9.8           
## [101] grid_4.1.2                blob_1.2.4               
## [103] digest_0.6.33             R.utils_2.12.2           
## [105] munsell_0.5.0             beeswarm_0.4.0           
## [107] viridisLite_0.4.2         vipor_0.4.5              
## [109] bslib_0.5.1